With the data set of your choice, after ensuring the variable(s) you’re exploring are indeed factors, you are expected to:
Drop factor / levels; Reorder levels based on knowledge from data. We’ve elaborated on these steps for the gapminder and singer data sets below.
Be sure to also characterize the (derived) data before and after your factor re-leveling:
Explore the effects of arrange(). Does merely arranging the data have any effect on, say, a figure? Explore the effects of reordering a factor and factor reordering coupled with arrange(). Especially, what effect does this have on a figure? These explorations should involve the data, the factor levels, and some figures.
Elaboration for the gapminder data set Drop Oceania. Filter the Gapminder data to remove observations associated with the continent of Oceania. Additionally, remove unused factor levels. Provide concrete information on the data before and after removing these rows and Oceania; address the number of rows and the levels of the affected factors.
library(gapminder)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.6
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts --------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(knitr)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
is.factor(gapminder$continent)
## [1] TRUE
head(gapminder)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
gapminder%>%
group_by(continent)%>%
summarize(num=n())
## # A tibble: 5 x 2
## continent num
## <fct> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24
Continent is a factor with five levels, and a total of
624+300+396+360+24
## [1] 1704
rows
First, we will look at the data with Oceania
gapminder %>%
summarize(
nrow = nrow(gapminder),
nlevels_continent = nlevels(gapminder$continent),
nlevels_country = nlevels(gapminder$country)) %>%
knitr::kable(col.names = c("Total rows in gapminder", "Levels of continent", "Levels of country"))
| Total rows in gapminder | Levels of continent | Levels of country |
|---|---|---|
| 1704 | 5 | 142 |
Now, let’s see how the rows cgange if Oceania gets dropped.
gapminder_without_oceania <- gapminder %>%
filter(continent != "Oceania")
gapminder_without_oceania %>%
summarize(
nrow = nrow(gapminder_without_oceania),
nlevels_continent = nlevels(gapminder_without_oceania$continent),
nlevels_country = nlevels(gapminder_without_oceania$country)) %>%
knitr::kable(col.names = c("Total rows in gapminder without Oceania", "Levels of continent without Oceania", "Levels of country without Oceania"))
| Total rows in gapminder without Oceania | Levels of continent without Oceania | Levels of country without Oceania |
|---|---|---|
| 1680 | 5 | 142 |
Let’s look how many rows each continent has
gapminder_without_oceania%>%
group_by(continent)%>%
summarize(num=n())
## # A tibble: 4 x 2
## continent num
## <fct> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
Continent is now a factor with four levels, and a total of
624+300+396+360
## [1] 1680
rows
Reorder the levels of country or continent. Use the forcats package to change the order of the factor levels, based on a principled summary of one of the quantitative variables. Consider experimenting with a summary statistic beyond the most basic choice of the median.
First let’s look at the standard deviation of countries
library(forcats)
gapminder_original_order <- gapminder %>%
filter(continent == "Africa") %>%
group_by(country) %>%
mutate(sd_life = sd(lifeExp)) %>%
select(country, sd_life) %>%
unique() # have to delete rows that repeat or I get an errr message
knitr::kable(gapminder_original_order)
| country | sd_life |
|---|---|
| Algeria | 10.340069 |
| Angola | 4.005276 |
| Benin | 6.128681 |
| Botswana | 5.929476 |
| Burkina Faso | 6.845792 |
| Burundi | 3.174882 |
| Cameroon | 5.467960 |
| Central African Republic | 4.720690 |
| Chad | 4.887978 |
| Comoros | 8.132353 |
| Congo, Dem. Rep. | 2.869210 |
| Congo, Rep. | 4.878987 |
| Cote d’Ivoire | 4.421421 |
| Djibouti | 6.710003 |
| Egypt | 10.062500 |
| Equatorial Guinea | 5.600456 |
| Eritrea | 6.903925 |
| Ethiopia | 5.627192 |
| Gabon | 8.933194 |
| Gambia | 10.545929 |
| Ghana | 5.846972 |
| Guinea | 7.743160 |
| Guinea-Bissau | 4.937368 |
| Kenya | 5.596199 |
| Lesotho | 5.914277 |
| Liberia | 2.419094 |
| Libya | 11.372181 |
| Madagascar | 7.297844 |
| Malawi | 4.607323 |
| Mali | 6.808537 |
| Mauritania | 8.057280 |
| Mauritius | 6.497274 |
| Morocco | 9.806162 |
| Mozambique | 4.599184 |
| Namibia | 6.303906 |
| Niger | 6.509444 |
| Nigeria | 4.021207 |
| Reunion | 8.434938 |
| Rwanda | 6.307415 |
| Sao Tome and Principe | 6.283923 |
| Senegal | 9.141934 |
| Sierra Leone | 3.937828 |
| Somalia | 4.503828 |
| South Africa | 5.455502 |
| Sudan | 6.927843 |
| Swaziland | 6.562668 |
| Tanzania | 3.602435 |
| Togo | 7.247043 |
| Tunisia | 10.701244 |
| Uganda | 3.747267 |
| Zambia | 4.453246 |
| Zimbabwe | 7.071816 |
Now, let’s rearrange from highest to lowest standard deviation
gapminder_new_order <- gapminder_original_order %>%
arrange(desc(sd_life))
knitr::kable(gapminder_new_order)
| country | sd_life |
|---|---|
| Libya | 11.372181 |
| Tunisia | 10.701244 |
| Gambia | 10.545929 |
| Algeria | 10.340069 |
| Egypt | 10.062500 |
| Morocco | 9.806162 |
| Senegal | 9.141934 |
| Gabon | 8.933194 |
| Reunion | 8.434938 |
| Comoros | 8.132353 |
| Mauritania | 8.057280 |
| Guinea | 7.743160 |
| Madagascar | 7.297844 |
| Togo | 7.247043 |
| Zimbabwe | 7.071816 |
| Sudan | 6.927843 |
| Eritrea | 6.903925 |
| Burkina Faso | 6.845792 |
| Mali | 6.808537 |
| Djibouti | 6.710003 |
| Swaziland | 6.562668 |
| Niger | 6.509444 |
| Mauritius | 6.497274 |
| Rwanda | 6.307415 |
| Namibia | 6.303906 |
| Sao Tome and Principe | 6.283923 |
| Benin | 6.128681 |
| Botswana | 5.929476 |
| Lesotho | 5.914277 |
| Ghana | 5.846972 |
| Ethiopia | 5.627192 |
| Equatorial Guinea | 5.600456 |
| Kenya | 5.596199 |
| Cameroon | 5.467960 |
| South Africa | 5.455502 |
| Guinea-Bissau | 4.937368 |
| Chad | 4.887978 |
| Congo, Rep. | 4.878987 |
| Central African Republic | 4.720690 |
| Malawi | 4.607323 |
| Mozambique | 4.599184 |
| Somalia | 4.503828 |
| Zambia | 4.453246 |
| Cote d’Ivoire | 4.421421 |
| Nigeria | 4.021207 |
| Angola | 4.005276 |
| Sierra Leone | 3.937828 |
| Uganda | 3.747267 |
| Tanzania | 3.602435 |
| Burundi | 3.174882 |
| Congo, Dem. Rep. | 2.869210 |
| Liberia | 2.419094 |
Now lets look at a figure. Here we look at GDP per capita per country in 2007.
gap_2007 <- gapminder %>%
filter(year == 2007)
ggplot(gap_2007, aes(gdpPercap, country)) + geom_point()+
xlab( "GDP per capita") +
ylab( "Country" ) +
ggtitle( "GDP per capita by country in 2007 unsorted" ) +
theme_light()
Unfortunately, the data is unsorted and so it’s not so easy to look at it. Now let’s arrange it by GDP.
#Let's use `fct_reorder()` to reorder the countries by gdp per capita, and produce the same plot:
gap_2007 %>%
mutate(country = fct_reorder(country, gdpPercap)) %>%
ggplot(aes(gdpPercap, country)) + geom_point()+
xlab( "GDP per capita") +
ylab( "Country" ) +
ggtitle( "GDP per capita by country in 2007" ) +
theme_light()
Part 2: File I/O Experiment with one or more of write_csv()/read_csv() (and/or TSV friends), saveRDS()/readRDS(), dput()/dget(). Create something new, probably by filtering or grouped-summarization of Singer or Gapminder. I highly recommend you fiddle with the factor levels, i.e. make them non-alphabetical (see previous section). Explore whether this survives the round trip of writing to file then reading back in.
First I export the dataset I created above to a csv file.
write_csv(gap_2007, "gap_2007.csv")
Now, let’s see if this new data file when we try to create the same plot as above is ordered by GDP per capita
read_csv("gap_2007.csv") %>% #import .csv
ggplot(aes(gdpPercap, country)) + geom_point()+
xlab( "GDP per capita") +
ylab( "Country" ) +
ggtitle( "GDP per capita by country in 2007" ) +
theme_light()
## Parsed with column specification:
## cols(
## country = col_character(),
## continent = col_character(),
## year = col_integer(),
## lifeExp = col_double(),
## pop = col_integer(),
## gdpPercap = col_double()
## )
As we can see, it isn’t.
Part 3: Visualization design Remake at least one figure or create a new one, in light of something you learned in the recent class meetings about visualization design and color. Maybe juxtapose your first attempt and what you obtained after some time spent working on it. Reflect on the differences. If using Gapminder, you can use the country or continent color scheme that ships with Gapminder. Consult the dimensions listed in All the Graph Things.
Then, make a new graph by converting this visual (or another, if you’d like) to a plotly graph. What are some things that plotly makes possible, that are not possible with a regular ggplot2 graph?
Spread of GDP per cap by year by continent
Now let’s look at a graph that I made for a previous homework asignment and compare ggplot with plotly
gdp.2 <- gapminder %>%
group_by(continent, year) %>%
summarize(Std.Deviation = sd(gdpPercap),
Variance = var(gdpPercap))
ggplot(gdp.2, aes(year)) +
geom_line(aes(y=Std.Deviation, color=continent)) +
scale_size_area()
gdp_spread <- ggplot(gdp.2, aes(year)) +
geom_line(aes(y=Std.Deviation, color=continent)) +
scale_size_area()
ggplotly(gdp_spread)
We can alos look at other functions that can make use of ggplot (such as visreg) and see if they can likewise be converted into plotly.
In this example, I’m looking at an interaction between population and GDP per capita in predicting life expectancy for the year 2007. (Note: This makes little sense theoretically and as we can see the interaction term is not significant, but it serves to illustrate the possibilities of plotly.)
gap_2007 <- gapminder %>%
filter(year == 2007)
m1 <- lm(lifeExp ~ gdpPercap*pop, data=gap_2007)
summary(m1)
##
## Call:
## lm(formula = lifeExp ~ gdpPercap * pop, data = gap_2007)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.466 -5.910 1.877 6.942 13.393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.910e+01 1.056e+00 55.947 <2e-16 ***
## gdpPercap 6.575e-04 6.393e-05 10.284 <2e-16 ***
## pop 9.386e-09 6.428e-09 1.460 0.146
## gdpPercap:pop -4.595e-13 7.586e-13 -0.606 0.546
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.89 on 138 degrees of freedom
## Multiple R-squared: 0.4693, Adjusted R-squared: 0.4578
## F-statistic: 40.68 on 3 and 138 DF, p-value: < 2.2e-16
psych::describe(gap_2007$pop) #extract mean and sd to look at population as the moderator and define three levels (mean and +/- 1sd)
## vars n mean sd median trimmed mad min
## X1 1 142 44021220 147621398 10517531 18112754 12212490 199579
## max range skew kurtosis se
## X1 1318683096 1318483517 7.25 55.42 12388113
library(visreg)
visreg(m1, "gdpPercap", by="pop", breaks=c(-103600178,44021220,191642618), overlay=TRUE,
band=FALSE, ylab="Life Expectancy", xlab="GDP per capita",
bty="n", partial=FALSE, rug=FALSE, gg=TRUE)
gap_interaction <-visreg(m1, "gdpPercap", by="pop", breaks=c(-103600178,44021220,191642618), overlay=TRUE,
band=FALSE, ylab="Life Expectancy", xlab="GDP per capita",
bty="n", partial=FALSE, rug=FALSE, gg=TRUE)
ggplotly(gap_interaction)
Note; the plotly graphs won’t render in this file, so you have to check them out seprately. Ytr hovering over them, they have many useful functions, such as zooming in and out.
Part 4: Writing figures to file Use ggsave() to explicitly save a plot to file. Then use to load and embed it in your report. You can play around with various options, such as:
Arguments of ggsave(), such as width, height, resolution or text scaling. Various graphics devices, e.g. a vector vs. raster format. Explicit provision of the plot object p via ggsave(…, plot = p). Show a situation in which this actually matters.
ggsave("gap_interaction.png", gap_interaction, width=40, height=40, units = "cm", device = 'png')
This graph got automatically saved in my Homework 5 folder on my laptop.